Contents

This is part of a bigger study you can find here

1 Load disease expression

The expression data is provided as counts in a tab separate value format. In another file there is the phenoData associated to the samples.

count <- read.delim("../data/ALD_ramon/refseq_counts_ALD.tsv", row.names = 1)
phenoData <- read.csv("../data/ALD_ramon/NGS_AHsteps.design.csv", row.names = 1)

First we will explore a little bit the data, in case we need to correct something:

library("geneplotter")
summary(phenoData)
##      Group         Sample.id        L                               G     
##  E      :12   INTEAM_A01: 1   E      :12   Early.ASH                 :12  
##  C      :11   INTEAM_A02: 1   C      :11   AH.Non.severe             :11  
##  D      :10   INTEAM_A03: 1   D      :10   Explants.from.AH          :10  
##  I      :10   INTEAM_A04: 1   I      :10   Normal.livers             :10  
##  A      : 9   INTEAM_A05: 1   A      : 9   AH.Severe.-.non-responders: 9  
##  B      : 9   INTEAM_A06: 1   B      : 9   AH.Severe.-.responders    : 9  
##  (Other):27   (Other)   :82   (Other):27   (Other)                   :27
phenoData
##     Group  Sample.id L                          G
## A01     A INTEAM_A01 A     AH.Severe.-.responders
## A02     A INTEAM_A02 A     AH.Severe.-.responders
## A03     A INTEAM_A03 A     AH.Severe.-.responders
## A04     A INTEAM_A04 A     AH.Severe.-.responders
## A05     A INTEAM_A05 A     AH.Severe.-.responders
## A06     A INTEAM_A06 A     AH.Severe.-.responders
## A07     A INTEAM_A07 A     AH.Severe.-.responders
## A08     A INTEAM_A08 A     AH.Severe.-.responders
## A09     A INTEAM_A09 A     AH.Severe.-.responders
## B10     B INTEAM_B10 B AH.Severe.-.non-responders
## B11     B INTEAM_B11 B AH.Severe.-.non-responders
## B12     B INTEAM_B12 B AH.Severe.-.non-responders
## B13     B INTEAM_B13 B AH.Severe.-.non-responders
## B14     B INTEAM_B14 B AH.Severe.-.non-responders
## B15     B INTEAM_B15 B AH.Severe.-.non-responders
## B16     B INTEAM_B16 B AH.Severe.-.non-responders
## B17     B INTEAM_B17 B AH.Severe.-.non-responders
## B18     B INTEAM_B18 B AH.Severe.-.non-responders
## C19     C INTEAM_C19 C              AH.Non.severe
## C20     C INTEAM_C20 C              AH.Non.severe
## C21     C INTEAM_C21 C              AH.Non.severe
## C22     C INTEAM_C22 C              AH.Non.severe
## C23     C INTEAM_C23 C              AH.Non.severe
## C24     C INTEAM_C24 C              AH.Non.severe
## C25     C INTEAM_C25 C              AH.Non.severe
## C26     C INTEAM_C26 C              AH.Non.severe
## C27     C INTEAM_C27 C              AH.Non.severe
## C28     C INTEAM_C28 C              AH.Non.severe
## C29     C INTEAM_C29 C              AH.Non.severe
## D30     D INTEAM_D30 D           Explants.from.AH
## D31     D INTEAM_D31 D           Explants.from.AH
## D32     D INTEAM_D32 D           Explants.from.AH
## D34     D INTEAM_D34 D           Explants.from.AH
## D35     D INTEAM_D35 D           Explants.from.AH
## D36     D INTEAM_D36 D           Explants.from.AH
## D37     D INTEAM_D37 D           Explants.from.AH
## D38     D INTEAM_D38 D           Explants.from.AH
## D39     D INTEAM_D39 D           Explants.from.AH
## D40     D INTEAM_D40 D           Explants.from.AH
## E41     E INTEAM_E41 E                  Early.ASH
## E42     E INTEAM_E42 E                  Early.ASH
## E43     E INTEAM_E43 E                  Early.ASH
## E44     E INTEAM_E44 E                  Early.ASH
## E45     E INTEAM_E45 E                  Early.ASH
## E46     E INTEAM_E46 E                  Early.ASH
## E47     E INTEAM_E47 E                  Early.ASH
## E48     E INTEAM_E48 E                  Early.ASH
## E49     E INTEAM_E49 E                  Early.ASH
## E50     E INTEAM_E50 E                  Early.ASH
## E51     E INTEAM_E51 E                  Early.ASH
## E52     E INTEAM_E52 E                  Early.ASH
## F53     F INTEAM_F53 F                Hepatitis.C
## F55     F INTEAM_F55 F                Hepatitis.C
## F56     F INTEAM_F56 F                Hepatitis.C
## F57     F INTEAM_F57 F                Hepatitis.C
## F58     F INTEAM_F58 F                Hepatitis.C
## F59     F INTEAM_F59 F                Hepatitis.C
## F60     F INTEAM_F60 F                Hepatitis.C
## F61     F INTEAM_F61 F                Hepatitis.C
## F62     F INTEAM_F62 F                Hepatitis.C
## G63     G INTEAM_G63 G                       NASH
## G64     G INTEAM_G64 G                       NASH
## G65     G INTEAM_G65 G                       NASH
## G66     G INTEAM_G66 G                       NASH
## G67     G INTEAM_G67 G                       NASH
## G68     G INTEAM_G68 G                       NASH
## G69     G INTEAM_G69 G                       NASH
## G70     G INTEAM_G70 G                       NASH
## G71     G INTEAM_G71 G                       NASH
## H72     H INTEAM_H72 H      Compensated.cirrhosis
## H73     H INTEAM_H73 H      Compensated.cirrhosis
## H74     H INTEAM_H74 H      Compensated.cirrhosis
## H75     H INTEAM_H75 H      Compensated.cirrhosis
## H76     H INTEAM_H76 H      Compensated.cirrhosis
## H77     H INTEAM_H77 H      Compensated.cirrhosis
## H78     H INTEAM_H78 H      Compensated.cirrhosis
## H79     H INTEAM_H79 H      Compensated.cirrhosis
## H80     H INTEAM_H80 H      Compensated.cirrhosis
## I89     I INTEAM_I89 I              Normal.livers
## I90     I INTEAM_I90 I              Normal.livers
## I91     I INTEAM_I91 I              Normal.livers
## I92     I INTEAM_I92 I              Normal.livers
## I93     I INTEAM_I93 I              Normal.livers
## I94     I INTEAM_I94 I              Normal.livers
## I95     I INTEAM_I95 I              Normal.livers
## I96     I INTEAM_I96 I              Normal.livers
## I97     I INTEAM_I97 I              Normal.livers
## I98     I INTEAM_I98 I              Normal.livers
count[1:5, 1:5]
##           A01 A02 A03 A04 A05
## DDX11L1     3   8   0   2   2
## WASH7P     43  82 104  66  82
## MIR6859-3   0   0   0   0   0
## MIR6859-4   0   0   0   0   0
## MIR6859-1   0   0   0   0   0

So we will delete two colums that has the same information as the column G. And we will use more easily readable names:

phenoData <- phenoData[, c("Sample.id", "G")]
colnames(phenoData) <- c("Sample", "Status")
levels(phenoData$Status) <- c("AH", "Non-responders", "Responders", "C.Comp", 
                              "ASH", "Explants", "HCV", "NASH", "Normal")
phenoData
##         Sample         Status
## A01 INTEAM_A01     Responders
## A02 INTEAM_A02     Responders
## A03 INTEAM_A03     Responders
## A04 INTEAM_A04     Responders
## A05 INTEAM_A05     Responders
## A06 INTEAM_A06     Responders
## A07 INTEAM_A07     Responders
## A08 INTEAM_A08     Responders
## A09 INTEAM_A09     Responders
## B10 INTEAM_B10 Non-responders
## B11 INTEAM_B11 Non-responders
## B12 INTEAM_B12 Non-responders
## B13 INTEAM_B13 Non-responders
## B14 INTEAM_B14 Non-responders
## B15 INTEAM_B15 Non-responders
## B16 INTEAM_B16 Non-responders
## B17 INTEAM_B17 Non-responders
## B18 INTEAM_B18 Non-responders
## C19 INTEAM_C19             AH
## C20 INTEAM_C20             AH
## C21 INTEAM_C21             AH
## C22 INTEAM_C22             AH
## C23 INTEAM_C23             AH
## C24 INTEAM_C24             AH
## C25 INTEAM_C25             AH
## C26 INTEAM_C26             AH
## C27 INTEAM_C27             AH
## C28 INTEAM_C28             AH
## C29 INTEAM_C29             AH
## D30 INTEAM_D30       Explants
## D31 INTEAM_D31       Explants
## D32 INTEAM_D32       Explants
## D34 INTEAM_D34       Explants
## D35 INTEAM_D35       Explants
## D36 INTEAM_D36       Explants
## D37 INTEAM_D37       Explants
## D38 INTEAM_D38       Explants
## D39 INTEAM_D39       Explants
## D40 INTEAM_D40       Explants
## E41 INTEAM_E41            ASH
## E42 INTEAM_E42            ASH
## E43 INTEAM_E43            ASH
## E44 INTEAM_E44            ASH
## E45 INTEAM_E45            ASH
## E46 INTEAM_E46            ASH
## E47 INTEAM_E47            ASH
## E48 INTEAM_E48            ASH
## E49 INTEAM_E49            ASH
## E50 INTEAM_E50            ASH
## E51 INTEAM_E51            ASH
## E52 INTEAM_E52            ASH
## F53 INTEAM_F53            HCV
## F55 INTEAM_F55            HCV
## F56 INTEAM_F56            HCV
## F57 INTEAM_F57            HCV
## F58 INTEAM_F58            HCV
## F59 INTEAM_F59            HCV
## F60 INTEAM_F60            HCV
## F61 INTEAM_F61            HCV
## F62 INTEAM_F62            HCV
## G63 INTEAM_G63           NASH
## G64 INTEAM_G64           NASH
## G65 INTEAM_G65           NASH
## G66 INTEAM_G66           NASH
## G67 INTEAM_G67           NASH
## G68 INTEAM_G68           NASH
## G69 INTEAM_G69           NASH
## G70 INTEAM_G70           NASH
## G71 INTEAM_G71           NASH
## H72 INTEAM_H72         C.Comp
## H73 INTEAM_H73         C.Comp
## H74 INTEAM_H74         C.Comp
## H75 INTEAM_H75         C.Comp
## H76 INTEAM_H76         C.Comp
## H77 INTEAM_H77         C.Comp
## H78 INTEAM_H78         C.Comp
## H79 INTEAM_H79         C.Comp
## H80 INTEAM_H80         C.Comp
## I89 INTEAM_I89         Normal
## I90 INTEAM_I90         Normal
## I91 INTEAM_I91         Normal
## I92 INTEAM_I92         Normal
## I93 INTEAM_I93         Normal
## I94 INTEAM_I94         Normal
## I95 INTEAM_I95         Normal
## I96 INTEAM_I96         Normal
## I97 INTEAM_I97         Normal
## I98 INTEAM_I98         Normal

However we are only interested on those related to alcohol disease, so we need to subset them:

alcohol <- !phenoData$Status %in% c("HCV", "NASH", "Explants")
phenoData <- phenoData[alcohol, ]
count <- count[, colnames(count) %in% rownames(phenoData)]

We can calculate the normalization factors used to calculate the Counts per million read:

library("edgeR")
dge <- DGEList(counts = count, group = phenoData$Status)
ord <- order(dge$sample$lib.size)

barplot(dge$sample$lib.size[ord]/1e6, las = 1, ylab = "Millions of reads",
        xlab = "Samples", main = "Library size of the samples", 
        col = droplevels(phenoData$Status))

legend("topleft", legend = levels(droplevels(phenoData$Status)), 
       fill = seq_along(levels(droplevels(phenoData$Status))))

dge <- calcNormFactors(dge)

We can normalize the data to counts per millon of reads taking into account that we know the normalized library size:

expression <- cpm(dge, normalized.lib.sizes = TRUE, log = TRUE,
                  prior.count = 3)

2 Check quality pre-normalizations

We can observe the quality of the samples with a boxplot and the density of expressions:

boxplot(expression, main = "Expression per sample", 
        col = droplevels(phenoData$Status))

multidensity(expression, main = "Densities of expression", legend = NULL)

avgexp <- rowMeans(expression)
hist(avgexp, main = "Histogram of mean expression per gene")
abline(v = 1, col = "red", lwd = 2)

There is a lot of differences between samples on the low expressed genes. We filter those genes that are below 1, as they are unreliable:

expression <- expression[avgexp > 1, ]
multidensity(expression, main = "Densities of expression", legend = NULL)

Now the distribution of the expression between samples is much more comparable. Following the recomendation of the edgeR package we recalculate the normalization factor without the lowly expressed genes:

dge <- calcNormFactors(dge[avgexp > 1, ])
expression <- cpm(dge, normalized.lib.sizes = TRUE, log = TRUE,
                  prior.count = 3)

We can check how has this normalization affected the values:

boxplot(expression, main = "Expression per sample", col = 
        droplevels(phenoData$Status))

multidensity(expression, main = "Densities of expression", legend = NULL)

avgexp <- rowMeans(expression)
hist(avgexp, main = "Histogram of mean expression per gene")

Now are a little bit more comparable.
To work easier with the data I proceed to create an ExpressionSet

library("Biobase")
library("AnnotationDbi")
ALD <- ExpressionSet(as.matrix(expression))
phenoData(ALD) <- AnnotatedDataFrame(droplevels(phenoData))
ALD
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 13366 features, 60 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   rowNames: A01 A02 ... I98 (60 total)
##   varLabels: Sample Status
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

3 Check the normalizations applied

We take advantatge of the plotMDS function of limma package to plot the MDS of the samples, to see how close are between them.

plotMDS(ALD, top = Inf, labels = phenoData(ALD)$Status, main = "Samples relationships")

We can see that they are clearly separated in two groups, but some samples like the Alcohol Hepatitis samples are on both sites of the first dimension. We can also observe that the cirrhosis compensated samples are closer to the normal or healthy samples.

4 Design

We are interested in the progression of the disease, both how does the expression of genes change in each step, and the overall change that leads the disease.

The design we will use to find the differentially expressed genes in each step is the following:

# To ensure sintactically valid names
names <- make.names(levels(phenoData(ALD)$Status))
design <- sapply(names, function(x){
  x <- ifelse(x == "Non.responders", "Non-responders", x)
  ifelse(phenoData(ALD)$Status == x, 1, 0)
})
rownames(design) <- rownames(phenoData(ALD))
design
##     AH Non.responders Responders C.Comp ASH Normal
## A01  0              0          1      0   0      0
## A02  0              0          1      0   0      0
## A03  0              0          1      0   0      0
## A04  0              0          1      0   0      0
## A05  0              0          1      0   0      0
## A06  0              0          1      0   0      0
## A07  0              0          1      0   0      0
## A08  0              0          1      0   0      0
## A09  0              0          1      0   0      0
## B10  0              1          0      0   0      0
## B11  0              1          0      0   0      0
## B12  0              1          0      0   0      0
## B13  0              1          0      0   0      0
## B14  0              1          0      0   0      0
## B15  0              1          0      0   0      0
## B16  0              1          0      0   0      0
## B17  0              1          0      0   0      0
## B18  0              1          0      0   0      0
## C19  1              0          0      0   0      0
## C20  1              0          0      0   0      0
## C21  1              0          0      0   0      0
## C22  1              0          0      0   0      0
## C23  1              0          0      0   0      0
## C24  1              0          0      0   0      0
## C25  1              0          0      0   0      0
## C26  1              0          0      0   0      0
## C27  1              0          0      0   0      0
## C28  1              0          0      0   0      0
## C29  1              0          0      0   0      0
## E41  0              0          0      0   1      0
## E42  0              0          0      0   1      0
## E43  0              0          0      0   1      0
## E44  0              0          0      0   1      0
## E45  0              0          0      0   1      0
## E46  0              0          0      0   1      0
## E47  0              0          0      0   1      0
## E48  0              0          0      0   1      0
## E49  0              0          0      0   1      0
## E50  0              0          0      0   1      0
## E51  0              0          0      0   1      0
## E52  0              0          0      0   1      0
## H72  0              0          0      1   0      0
## H73  0              0          0      1   0      0
## H74  0              0          0      1   0      0
## H75  0              0          0      1   0      0
## H76  0              0          0      1   0      0
## H77  0              0          0      1   0      0
## H78  0              0          0      1   0      0
## H79  0              0          0      1   0      0
## H80  0              0          0      1   0      0
## I89  0              0          0      0   0      1
## I90  0              0          0      0   0      1
## I91  0              0          0      0   0      1
## I92  0              0          0      0   0      1
## I93  0              0          0      0   0      1
## I94  0              0          0      0   0      1
## I95  0              0          0      0   0      1
## I96  0              0          0      0   0      1
## I97  0              0          0      0   0      1
## I98  0              0          0      0   0      1

As you can see each group has its own column to increase the power to the model, and to allow for more informative contrasts. To see the overall change of the disease we can use only one column. We are not interested in other phenotipic change, so how the patient responds to drug is not interesting to us, and it is considered as a single category. In the previous design we will check that Responders and Non-responders are almost equal in gene expression.

refact <- function(x) {
  if (x == "Normal") {
    0
  } else if ( x == "ASH") {
    1
  } else if (x == "C.Comp") {
    2
  } else if (x == "AH") {
    4
  } else if (x == "Responders") {
    5
  } else if (x == "Non-responders") {
    5
  } else {
    stop("Unexpected level")
  }
}
design2 <- sapply(as.character(phenoData(ALD)$Status), refact)
design2 <- as.matrix(design2)
design2 <- cbind(rep(1, ncol(design2)), design2)
colnames(design2) <- c("(Intercept)", "Progression")
rownames(design2) <- rownames(phenoData(ALD))
design2
##     (Intercept) Progression
## A01           1           5
## A02           1           5
## A03           1           5
## A04           1           5
## A05           1           5
## A06           1           5
## A07           1           5
## A08           1           5
## A09           1           5
## B10           1           5
## B11           1           5
## B12           1           5
## B13           1           5
## B14           1           5
## B15           1           5
## B16           1           5
## B17           1           5
## B18           1           5
## C19           1           4
## C20           1           4
## C21           1           4
## C22           1           4
## C23           1           4
## C24           1           4
## C25           1           4
## C26           1           4
## C27           1           4
## C28           1           4
## C29           1           4
## E41           1           1
## E42           1           1
## E43           1           1
## E44           1           1
## E45           1           1
## E46           1           1
## E47           1           1
## E48           1           1
## E49           1           1
## E50           1           1
## E51           1           1
## E52           1           1
## H72           1           2
## H73           1           2
## H74           1           2
## H75           1           2
## H76           1           2
## H77           1           2
## H78           1           2
## H79           1           2
## H80           1           2
## I89           1           0
## I90           1           0
## I91           1           0
## I92           1           0
## I93           1           0
## I94           1           0
## I95           1           0
## I96           1           0
## I97           1           0
## I98           1           0

Normalize the expression by calculating an appropriate observation weight.

v1 <- voom(dge, design, plot = TRUE, normalize.method = "cyclicloess")

v2 <- voom(dge, design2, plot = TRUE, normalize.method = "cyclicloess")

Note that even if we filter those with low expression we assign the correct library size, to avoid over estimation.
We store the normalized expression and the pehnotypic data for uses on the network construction step.

save(v1, v2, phenoData, file = "Expression_ALD.RData")

5 Estimate surrogate variables

We can observe how good they behave

library("sva")
mod0 <- matrix(1, nrow = nrow(design))
pValues <- f.pvalue(v1$E, design, mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")

pValues <- f.pvalue(v2$E, design2, mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")

We can further estimate the number of surrogate variables and the surrogate variables with the package sva:

# Our null model is that all samples express the same
sv1 <- sva(v1$E, design, mod0)
## Number of significant surrogate variables is:  9 
## Iteration (out of 5 ):1  2  3  4  5
head(sv1$sv)
##              [,1]         [,2]          [,3]        [,4]        [,5]
## [1,]  0.005481518 -0.141434253  0.1587707693 -0.06776785  0.01947042
## [2,] -0.094224387  0.013050722  0.1739487832  0.18262989  0.09821819
## [3,]  0.213856473  0.073508014 -0.0910537884  0.05816032 -0.15425493
## [4,] -0.106717321  0.005084984  0.0003759116 -0.02016505 -0.20718822
## [5,]  0.245334581  0.084746576  0.0853614231 -0.07844789 -0.01199353
## [6,] -0.194482272  0.216002809  0.1049633258 -0.20552222 -0.03118649
##             [,6]        [,7]        [,8]         [,9]
## [1,] -0.12280680  0.09382878 -0.13375748  0.018234721
## [2,] -0.12500493  0.02917357  0.31395361 -0.104017796
## [3,] -0.06398647  0.05341420  0.01274342 -0.079779581
## [4,] -0.10887725  0.03853589  0.09170448 -0.005219599
## [5,]  0.13953441 -0.06806961  0.27700219 -0.123476931
## [6,]  0.22603306 -0.05431003  0.07015424  0.149221537
sv2 <- sva(v2$E, design2, mod0)
## Number of significant surrogate variables is:  9 
## Iteration (out of 5 ):1  2  3  4  5
head(sv2$sv)
##             [,1]         [,2]        [,3]        [,4]        [,5]
## [1,]  0.08116204 -0.132977734  0.06674957 -0.07416035  0.01064854
## [2,] -0.06633221 -0.099896707  0.05749157  0.10956878 -0.12377807
## [3,]  0.06428549  0.022845989 -0.16266685  0.04152445  0.18203752
## [4,] -0.05189496  0.006240558  0.07228886 -0.03218212  0.13326773
## [5,]  0.06107009 -0.060973092 -0.17842932  0.02615797  0.04503047
## [6,] -0.22991834 -0.024717621  0.01344964 -0.14268577 -0.12630776
##              [,6]         [,7]         [,8]        [,9]
## [1,] -0.006770487  0.091989131 -0.011010203  0.08789165
## [2,]  0.055690674  0.108265559  0.269158006  0.01913017
## [3,] -0.026099900 -0.043709568  0.004741666 -0.03763857
## [4,] -0.123186222  0.093950042  0.045071074 -0.02029857
## [5,]  0.053279584 -0.009233111  0.157958205  0.11013432
## [6,] -0.161235760  0.051120619  0.063993241 -0.11412902

We can append the estimated surrogated variables to the design matrix for a better estimation of the effect of each paramter:

i.design <- cbind(design, sv1$sv)
colnames(i.design) <- c(colnames(design), paste0("X", 
                                                 1:ncol(sv1$sv)))
i.design2 <- cbind(design2, sv2$sv)
colnames(i.design2) <- c(colnames(design2), 
                         paste0("X", 1:ncol(sv2$sv)))
i.mod0 <- cbind(mod0, sv2$sv)
colnames(i.design2) <- c(colnames(design2), 
                         paste0("X", 1:ncol(sv2$sv)))

We can observe how this adjustment has helped to uniform the p-values. As we would expect a uniform probability of each gene being differentially expressed.

pValues <- f.pvalue(v1$E, i.design, i.mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")

pValues <- f.pvalue(v2$E, i.design2, i.mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")

As we can see the surrogate variables, don’t help to improve the data ajustment, because the distribution of the qValues is not uniform. We fit each design with the expression of the alcoholic liver disease:

fit <- lmFit(v1$E, design)
fit.2 <- lmFit(v2$E, design2)

6 Contrasts

comparisons, in the first design:

contrasts0 <- makeContrasts(
  "ASHvsNormal" = ASH - Normal,
  "CirrhosisVsNormal" = C.Comp - Normal,
  "AHvsNormal" = AH - Normal,
  "RespondersVsNormal" = Responders - Normal,
  "Non.respondersVsNormal" = Non.responders - Normal,
  
  # Assuming that some status are similar
  "SevereVsNormal" = (Responders + Non.responders)/2 - Normal,
  
  "RespondersVsNon.responders" = Responders - Non.responders,
  
  "CirrhosisVsASH" = C.Comp - ASH,
  "AHvsCirrhosis" = AH - C.Comp,
  
  "Non.respondersVsAH" = Non.responders - AH,
  "RespondersVsAH" = Responders - AH,
  "SevereVsHepatitis" = (Non.responders + Responders)/2 - AH,
  
  levels = design)
head(contrasts0)
##                 Contrasts
## Levels           ASHvsNormal CirrhosisVsNormal AHvsNormal
##   AH                       0                 0          1
##   Non.responders           0                 0          0
##   Responders               0                 0          0
##   C.Comp                   0                 1          0
##   ASH                      1                 0          0
##   Normal                  -1                -1         -1
##                 Contrasts
## Levels           RespondersVsNormal Non.respondersVsNormal SevereVsNormal
##   AH                              0                      0            0.0
##   Non.responders                  0                      1            0.5
##   Responders                      1                      0            0.5
##   C.Comp                          0                      0            0.0
##   ASH                             0                      0            0.0
##   Normal                         -1                     -1           -1.0
##                 Contrasts
## Levels           RespondersVsNon.responders CirrhosisVsASH AHvsCirrhosis
##   AH                                      0              0             1
##   Non.responders                         -1              0             0
##   Responders                              1              0             0
##   C.Comp                                  0              1            -1
##   ASH                                     0             -1             0
##   Normal                                  0              0             0
##                 Contrasts
## Levels           Non.respondersVsAH RespondersVsAH SevereVsHepatitis
##   AH                             -1             -1              -1.0
##   Non.responders                  1              0               0.5
##   Responders                      0              1               0.5
##   C.Comp                          0              0               0.0
##   ASH                             0              0               0.0
##   Normal                          0              0               0.0

Given those comparisons of interest we now evaluate the results:

fit2 <- contrasts.fit(fit, contrasts0)
fit2 <- eBayes(fit2)
results <- decideTests(fit2, adjust.method = "BH", lfc = log2(2))
summary(results)
##    ASHvsNormal CirrhosisVsNormal AHvsNormal RespondersVsNormal
## -1         426               498        748               1224
## 0        12570             12343      11767              10776
## 1          370               525        851               1366
##    Non.respondersVsNormal SevereVsNormal RespondersVsNon.responders
## -1                   1351           1281                          0
## 0                   10619          10728                      13366
## 1                    1396           1357                          0
##    CirrhosisVsASH AHvsCirrhosis Non.respondersVsAH RespondersVsAH
## -1            189           512                194            149
## 0           12888         12258              13123          13098
## 1             289           596                 49            119
##    SevereVsHepatitis
## -1               195
## 0              13058
## 1                113
fit2.2 <- eBayes(fit.2) 
results2 <- decideTests(fit2.2, adjust.method = "BH", lfc = log2(2))
summary(results2)
##    (Intercept) Progression
## -1          10          13
## 0          505       13344
## 1        12851           9

We can observe the quality of the t-values over the teoretical quantiles to observe if there is any assumption about the eBayes fitting which doesn’t holds. Note that I already modified the expected proportion of genes in the call to eBayes:

par(mfrow = c(ncol(results), 3))
out <- sapply(colnames(results), function(x){
  qqt(fit2$t[, x], df = fit2$df.total, main = paste("Student's t Q-Q Plot of", x))
  abline(0, 1)
  volcanoplot(fit2, coef = x, main = x)
  plotMD(fit2, coef = x, main = x)
})

We can visualize the same for the other model we have:

par(mfrow = c(1, 3))
qqt(fit2.2$t[, "Progression"], df = fit2.2$df.total, 
    main = "Student's t Q-Q Plot of Progression")
abline(0, 1)
volcanoplot(fit2.2, coef = "Progression", main = "Progression")
plotMD(fit2.2, coef = "Progression", main = "Progression")

Here we can see on the last plot that the expression of the genes show a bimodal distribution in the progression of the dissease. Which seems to confirm that at one point it descompensates and has a very bad prognossis.

par(mfrow = c(1, 2))
plotSA(fit2, main = "First model: Mean−variance trend")
plotSA(fit2.2, main = "Progression model: Mean−variance trend")

7 Store data

It is a good idea to store the data, not only the program of your analysis, so here I go:

save(fit, fit2, fit2.2, fit.2, design, design2, dge, ALD, contrasts0, v1, v2, 
     file = "ALD.RData")

8 Differentially expressed genes

8.1 In stepwise manner

We can plot the top 2000 genes with the highest absolute value of log fold-change in each contrast just for informative purposes:

par(mfrow = c(ncol(contrasts0), 2))
out <- sapply(colnames(contrasts0), function(x) {
  tt.ALD <- topTable(fit2, coef = x, sort.by = "logFC", number = Inf)
  tt.ALD <- tt.ALD[order(-abs(tt.ALD$logFC)), ]
  plot(density(tt.ALD[1:2000, "logFC"]), main = paste("Density of", x))
  hist(tt.ALD[1:2000, "logFC"], main = paste("Histogram of", x), xlab = "logFC")
})

## Progression For the progression model we can plot it too those top 2000 significative at the threshold of 0.05:

par(mfrow = c(1, 2))
tt.ALD2 <- topTable(fit2.2, coef = "Progression", sort.by = "logFC", 
                    number = Inf)
tt.ALD2 <- tt.ALD2[order(-abs(tt.ALD2$logFC)), ]
signif <- tt.ALD2[tt.ALD2$adj.P.Val < 0.05, ] # Subset of significant p-value
plot(density(signif[1:2000, "logFC"]), main = "Density")
hist(signif[1:2000, "logFC"], main = "Histogram", xlab = "logFC")

In order to store them we proceed with:

write.csv(tt.ALD2, file = "ALD_Progression.csv", na = "", row.names = FALSE)

We stored all the table with 13366 genes, even though some are not significant.

These are the genes relevant of the alocholic liver disease progression. There are thus 1034 genes overexpressed in alcoholic liver disease and 966 downregulated in the alcoholic liver disease, which are the main contributers to the disease progression. However the most significant way to see if there is a progression is using another thest the Kendall test: ### Kendall test This test check if a features has a constant grow or decrease among the samples, it is similar to a correlation test.

library("Kendall")
constant_change <- lapply(rownames(v2$E), function(x) {
  Kendall(v2$E[x, ], design2[, "Progression"])})
names(constant_change) <- rownames(v2$E)
head(constant_change)
## $WASH7P
## tau = -0.0209, 2-sided pvalue =0.83392
## 
## $MIR6723
## tau = -0.272, 2-sided pvalue =0.0048395
## 
## $LOC100288069
## tau = -0.0663, 2-sided pvalue =0.49559
## 
## $LINC01128
## tau = -0.48, 2-sided pvalue =6.8098e-07
## 
## $SAMD11
## tau = 0.353, 2-sided pvalue =0.00025594
## 
## $NOC2L
## tau = 0.223, 2-sided pvalue =0.021086
# Extract the appropiate values
kendall_tau <- sapply(constant_change, getElement, name = "tau")
kendall_pvalue <- sapply(constant_change, getElement, name = "sl")

# Store the names of the genes with a constant progression
genes_progression <- names(kendall_tau[kendall_pvalue < 0.05])
up <- names(kendall_tau[kendall_pvalue < 0.05 & kendall_tau > 0L])
down <- names(kendall_tau[kendall_pvalue < 0.05 & kendall_tau < 0L])
# Plot the histogram
hist(kendall_tau[kendall_pvalue < 0.05], xlab = "tau", 
     main = "Histogram of Kendall taus", 
     sub = "Significant with a 0.05 threshold")

Those genes show a trend to increase or decrease with the progression of the disease.

9 Venn diagrams

We can also see which genes are differentially expressed in each phase, and which are shared:

vennDiagram(results[, c(1:3, 6)], include = "up",
            circle.col = c("red", "green", "black", "blue"), 
            main = "Genes Up-regulated in the disease")

vennDiagram(results[, c(1:3, 6)], include = "down",
            circle.col = c("red", "green", "black", "blue"), 
            main = "Genes Down-regulated in the disease")

We can see that there are 299 genes which are differencially expressed significantly with an adjusted p-value below 0.05.

But we can plot it sepparatedly for each step, for a better comparison:

vennDiagram(results[, c(1, 2)], include = "up", 
            circle.col = c("green", "black"), 
            main = "Genes Up-regulated shared")

vennDiagram(results[, c(1, 2)], include = "down", 
            circle.col = c("green", "black"), 
            main = "Genes Down-regulated")


vennDiagram(results[, c(2, 3)], include = "up", 
            circle.col = c("green", "black"), 
            main = "Genes Up-regulated shared")

vennDiagram(results[, c(2, 3)], include = "down", 
            circle.col = c("green", "black"), 
            main = "Genes Down-regulated")


vennDiagram(results[, c(3, 6)], include = "up", 
            circle.col = c("green", "black"), 
            main = "Genes Up-regulated shared")

vennDiagram(results[, c(3, 6)], include = "down", 
            circle.col = c("green", "black"), 
            main = "Genes Down-regulated")

10 Functional enrichment

We can check for functional enrichment, to see if those few significant genes are more related to performing certain functions and processes. I check for pathways using the Reactome data base and the process using Gene Ontologies.

10.1 Reactome

We can see in Reactome which pathways are enriched for those genes, that are significantly differentially expressed.

library("ReactomePA")
## Loading required package: DOSE
## DOSE v3.0.9  For help: https://guangchuangyu.github.io/DOSE
## 
## If you use DOSE in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
## 
## Attaching package: 'DOSE'
## The following object is masked from 'package:lattice':
## 
##     dotplot
## ReactomePA v1.18.1  For help: https://guangchuangyu.github.io/ReactomePA
## 
## If you use ReactomePA in published research, please cite:
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
library("clusterProfiler")
## clusterProfiler v3.2.10  For help: https://guangchuangyu.github.io/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu., Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
DEG_up <- mapIds(org.Hs.eg.db, keys = up, keytype = "SYMBOL", 
                 column = "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
DEG_down <- mapIds(org.Hs.eg.db, keys =  down, keytype = "SYMBOL",
                   column = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
groups <- list(Up = DEG_up, Down = DEG_down)
cc <- compareCluster(groups, fun = "enrichPathway")
dotplot(cc)

We can observe differences in the comparison between genes up-regulated and dow-regulated. But for a more general overview is better to see all the genes involved

DEG_names <- mapIds(org.Hs.eg.db, keys = genes_progression, 
                    keytype = "SYMBOL", column = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
enrich <- enrichPathway(DEG_names, minGSSize = 2, maxGSSize = 2000)
write.csv(as.data.frame(enrich), file = "Reactome_DR.csv", 
          row.names = FALSE, na = "")
dotplot(enrich)

enrichMap(enrich, layout = igraph::layout_nicely,
          vertex.label.cex = 1, n = 15)

Now we can observe which pathways are enriched with those genes, there is a functional difference between the genes up and down regulated.

10.2 GO

We can observe in which biological process they are using topGO:

library("topGO")
topDiffGenes <- function(x) {
  return(x <= 0.05)
}
GOdata.bp <- new("topGOdata",
                 ontology = "BP",
                 description = "Biological process of the signature module.",
                 allGenes = kendall_pvalue,
                 annot = annFUN.org,
                 ID = "symbol",
                 mapping = "org.Hs.eg",
                 geneSel = topDiffGenes,
                 nodeSize = 5)
## 
## Building most specific GOs .....
##  ( 10017 GO terms found. )
## 
## Build GO DAG topology ..........
##  ( 13905 GO terms and 32890 relations. )
## 
## Annotating nodes ...............
##  ( 11178 genes annotated to the GO terms. )
save(GOdata.bp, file = "GO_ALD.RData")
resultFisher <- runTest(GOdata.bp, algorithm = "classic", statistic = "fisher")
## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 7509 nontrivial nodes
##       parameters: 
##           test statistic: fisher
resultKS.weight <- runTest(GOdata.bp, algorithm = 'weight01', statistic = "ks")
## 
##           -- Weight01 Algorithm -- 
## 
##       the algorithm is scoring 7509 nontrivial nodes
##       parameters: 
##           test statistic: ks
##           score order: increasing
## 
##   Level 19:  3 nodes to be scored    (0 eliminated genes)
## 
##   Level 18:  6 nodes to be scored    (0 eliminated genes)
## 
##   Level 17:  10 nodes to be scored   (17 eliminated genes)
## 
##   Level 16:  25 nodes to be scored   (48 eliminated genes)
## 
##   Level 15:  79 nodes to be scored   (105 eliminated genes)
## 
##   Level 14:  164 nodes to be scored  (296 eliminated genes)
## 
##   Level 13:  305 nodes to be scored  (826 eliminated genes)
## 
##   Level 12:  448 nodes to be scored  (1711 eliminated genes)
## 
##   Level 11:  670 nodes to be scored  (3337 eliminated genes)
## 
##   Level 10:  871 nodes to be scored  (4737 eliminated genes)
## 
##   Level 9:   1055 nodes to be scored (6652 eliminated genes)
## 
##   Level 8:   1064 nodes to be scored (7965 eliminated genes)
## 
##   Level 7:   1077 nodes to be scored (9007 eliminated genes)
## 
##   Level 6:   869 nodes to be scored  (9770 eliminated genes)
## 
##   Level 5:   530 nodes to be scored  (10311 eliminated genes)
## 
##   Level 4:   249 nodes to be scored  (10679 eliminated genes)
## 
##   Level 3:   63 nodes to be scored   (10832 eliminated genes)
## 
##   Level 2:   20 nodes to be scored   (10961 eliminated genes)
## 
##   Level 1:   1 nodes to be scored    (11019 eliminated genes)
resultKS.elim <- runTest(GOdata.bp, algorithm = "elim", statistic = "ks")
## 
##           -- Elim Algorithm -- 
## 
##       the algorithm is scoring 7509 nontrivial nodes
##       parameters: 
##           test statistic: ks
##           cutOff: 0.01
##           score order: increasing
## 
##   Level 19:  3 nodes to be scored    (0 eliminated genes)
## 
##   Level 18:  6 nodes to be scored    (0 eliminated genes)
## 
##   Level 17:  10 nodes to be scored   (0 eliminated genes)
## 
##   Level 16:  25 nodes to be scored   (0 eliminated genes)
## 
##   Level 15:  79 nodes to be scored   (0 eliminated genes)
## 
##   Level 14:  164 nodes to be scored  (67 eliminated genes)
## 
##   Level 13:  305 nodes to be scored  (91 eliminated genes)
## 
##   Level 12:  448 nodes to be scored  (365 eliminated genes)
## 
##   Level 11:  670 nodes to be scored  (674 eliminated genes)
## 
##   Level 10:  871 nodes to be scored  (1277 eliminated genes)
## 
##   Level 9:   1055 nodes to be scored (1913 eliminated genes)
## 
##   Level 8:   1064 nodes to be scored (2636 eliminated genes)
## 
##   Level 7:   1077 nodes to be scored (3500 eliminated genes)
## 
##   Level 6:   869 nodes to be scored  (3928 eliminated genes)
## 
##   Level 5:   530 nodes to be scored  (4809 eliminated genes)
## 
##   Level 4:   249 nodes to be scored  (5359 eliminated genes)
## 
##   Level 3:   63 nodes to be scored   (6200 eliminated genes)
## 
##   Level 2:   20 nodes to be scored   (6289 eliminated genes)
## 
##   Level 1:   1 nodes to be scored    (6289 eliminated genes)

allRes <- GenTable(GOdata.bp, classic = resultFisher, weight01 = resultKS.weight,
                   elim = resultKS.elim, orderBy = "weight01", topNodes = 1000)
allRes <- allRes[allRes$weight01 < 0.05, ]
write.csv(allRes, file = "GO_ALD_progression.csv", row.names = FALSE)
head(allRes)
##         GO.ID                                   Term Annotated Significant
## 11 GO:0070125 mitochondrial translational elongation        82          68
## 12 GO:0002576                 platelet degranulation        94          80
## 13 GO:0046951       ketone body biosynthetic process         7           7
## 14 GO:0097267         omega-hydroxylase P450 pathway         9           9
## 15 GO:0019439    aromatic compound catabolic process       369         268
## 16 GO:0008202              steroid metabolic process       221         168
##    Expected Rank in weight01 classic weight01    elim
## 11    57.55               11 0.00597  0.00014 0.00014
## 12    65.97               12 0.00060  0.00018 0.00018
## 13     4.91               13 0.08380  0.00021 0.00021
## 14     6.32               14 0.04125  0.00021 0.00021
## 15   258.97               15 0.16192  0.00024 0.24177
## 16   155.10               16 0.03093  0.00035 6.6e-05
showSigOfNodes(GOdata.bp, score(resultKS.weight), firstSigNodes = 10, useInfo = 'all')
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 131 
## Number of Edges = 228 
## 
## $complete.dag
## [1] "A graph with 131 nodes."
title(main = "GO analysis using Weight01 algorithm", line = -2)

In this plot we can observe the relationship between the top 10 significant gene ontologies, using the Weight01 algorithm.

11 SessionInfo

Here are the packages and the versions used to analyse these data and build this page:

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices datasets  utils    
## [8] methods   base     
## 
## other attached packages:
##  [1] sva_3.22.0           genefilter_1.56.0    mgcv_1.8-16         
##  [4] nlme_3.1-128         edgeR_3.16.4         limma_3.30.6        
##  [7] geneplotter_1.52.0   annotate_1.52.0      XML_3.98-1.5        
## [10] AnnotationDbi_1.36.0 IRanges_2.8.1        S4Vectors_0.12.0    
## [13] lattice_0.20-34      Biobase_2.34.0       BiocGenerics_0.20.0 
## [16] BiocStyle_2.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        RColorBrewer_1.1-2 bitops_1.0-6      
##  [4] tools_3.3.2        digest_0.6.10      RSQLite_1.1       
##  [7] evaluate_0.10      memoise_1.0.0      Matrix_1.2-7.1    
## [10] DBI_0.5-1          yaml_2.1.14        stringr_1.1.0     
## [13] knitr_1.15.1       locfit_1.5-9.1     rprojroot_1.1     
## [16] grid_3.3.2         survival_2.40-1    rmarkdown_1.2     
## [19] magrittr_1.5       backports_1.0.4    codetools_0.2-15  
## [22] htmltools_0.3.5    splines_3.3.2      xtable_1.8-2      
## [25] stringi_1.1.2      RCurl_1.95-4.8